Skip to content

graph : remove redundant GDN state transposes#20443

Merged
ggerganov merged 6 commits intomasterfrom
gg/gdn-fix-state-transpose
Mar 13, 2026
Merged

graph : remove redundant GDN state transposes#20443
ggerganov merged 6 commits intomasterfrom
gg/gdn-fix-state-transpose

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Mar 12, 2026

cont #20437
fix #20436

As correctly noted in #20436, there is no need to transpose the recurrent state in the GDN computation. Simplify the ggml graph for the unfused path + optimize the fused kernels using coalesced read/writes.

TODOs:

@ggerganov ggerganov requested a review from CISC as a code owner March 12, 2026 06:53
@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 12, 2026
@0cc4m
Copy link
Contributor

0cc4m commented Mar 12, 2026

This first or the Vulkan implementation first? It's gonna be stuck waiting for the CI forever, if we have to keep adapting it before it has a chance to finish. Otherwise I have to merge it without waiting for the CI.

@ggerganov
Copy link
Member Author

@0cc4m Yes, proceed with the Vulkan implementation. Sorry for the rapid changes.

@ORippler
Copy link
Collaborator

ORippler commented Mar 12, 2026

This currently speeds-up TG on NVGPU due to coalescing the writes (in TG we process n_token=1 and afterwards write the states back, whereas in PP we process ubatch tokens and write states back at the end). Uncoalesced reads are not so important on big caches, and I thus saw no trashing occur (though on pre-Ada GPUs this may happen). Further PP improvements will have to revolve around batching data-accesses (either process > 1 col per warp and do wider loads or unroll n_tokens). WIP for the latter would be #20449, as I'm not super familiar with data-layouting/alignment that comes into the cuda op

@ORippler
Copy link
Collaborator

Relevant NCU sections for TG

Untitled Untitled 2

Relevant NCU sections for PP

image Untitled

arkavo-com and others added 4 commits March 13, 2026 18:12
…20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ggerganov ggerganov force-pushed the gg/gdn-fix-state-transpose branch from 7ea6ee4 to fe3ef4a Compare March 13, 2026 16:13
@ggerganov ggerganov requested a review from 0cc4m as a code owner March 13, 2026 16:48
@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Mar 13, 2026
@ORippler
Copy link
Collaborator

We may consider removing the obsolete smem code in this PR (we unified HIP/CUDA paths in #20340 via #20391, and removed the SMEM as part of this). Can also file a separate PR if preferred

diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index eb3858820..1ce6d5f31 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -135,15 +135,6 @@ __global__ void gated_delta_net_cuda(const float * q,
     }
 }
 
-static size_t calculate_smem(const int sv, int cc)
-{
-    size_t smem = 0;
-    if ((GGML_CUDA_CC_IS_AMD(cc) && !GGML_CUDA_CC_IS_RDNA3(cc) && !GGML_CUDA_CC_IS_RDNA4(cc)) || GGML_CUDA_CC_IS_MTHREADS(cc)) {
-        smem = sv * sv * sizeof(float);
-    }
-    return smem;
-}
-
 template <bool KDA>
 static void launch_gated_delta_net(
         const float * q_d, const float * k_d, const float * v_d,
@@ -180,18 +171,14 @@ static void launch_gated_delta_net(
                 sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
             break;
         case 64: {
-            constexpr int sv = 64;
-            size_t smem = calculate_smem(sv, cc);
-            gated_delta_net_cuda<sv, KDA><<<grid_dims, block_dims, smem, stream>>>(
+            gated_delta_net_cuda<64, KDA><<<grid_dims, block_dims, 0, stream>>>(
                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
                 sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
             break;
         }
         case 128: {
-            constexpr int sv = 128;
-            size_t smem = calculate_smem(sv, cc);
-            gated_delta_net_cuda<sv, KDA><<<grid_dims, block_dims, smem, stream>>>(
+            gated_delta_net_cuda<128, KDA><<<grid_dims, block_dims, 0, stream>>>(
                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
                 sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);

@ggerganov ggerganov merged commit e30f1fd into master Mar 13, 2026
16 of 76 checks passed
@ggerganov ggerganov deleted the gg/gdn-fix-state-transpose branch March 13, 2026 20:12
@jeffbolznv
Copy link
Collaborator

This change made qwen3.5 about 10% slower for tg in the vulkan backend, with CPY getting 3x more expensive. We may be hitting some slow path, I'll look into it.

@0cc4m
Copy link
Contributor

0cc4m commented Mar 13, 2026

I saw something like that, but thought it best to wait until we get some more shader improvements in.

@jeffbolznv
Copy link
Collaborator

I don't think it was the copy shader, some driver issue was giving me bad timestamps. The timings I'm seeing now show the GDN op several times slower, but I'm not totally confident in the data.

@pedapudi
Copy link

pedapudi commented Mar 13, 2026

This PR restored a tg regression on Strix Halo for Qwen MoE models using ROCm backend. Thanks for the PR and to the original reporter for the issue.

It seems that gfx1151 is particularly sensitive to the changes being made for GDN support. Perhaps related to susceptibility to register pressure.

I'm wondering if it's possible for pointed testing for that architecture specifically given the apparent sensitivity. Fully acknowledging the combinatorics explosion of needing to test a wide variety of architectures and the slippery slope, but this is a pointed request specifically for GDN related changes given the apparent platform sensitivity and the relative popularity of the architecture. Ideally, GDN related updates in llama.cpp should help with performance across the board, but avoiding regressions would be a nice baseline :)

Will defer to @IMbackK for insights on change management in relation to AMD architectures.

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fused GDN kernel: column-wise state access causes cache thrashing on Metal/CUDA/CPU

7 participants